RWTH OCR: A Large Vocabulary Optical Character Recognition System for Arabic Scripts

نویسندگان

  • Philippe Dreuw
  • David Rybach
  • Georg Heigold
  • Hermann Ney
چکیده

We present a novel large vocabulary OCR system, which implements a 5 confidenceand margin-based discriminative training approach for model adap6 tation of an HMM based recognition system to handle multiple fonts, different 7 handwriting styles, and their variations. Most current HMM approaches are HTK 8 based systems which are maximum-likelihood (ML) trained and which try to adapt 9 their models to different writing styles using writer adaptive training, unsupervised 10 clustering, or additional writer specific data. Here, discriminative training based 11 on the Maximum Mutual Information (MMI) and Minimum Phone Error (MPE) 12 criteria are used instead. For model adaptation during decoding, an unsupervised 13 confidence-based discriminative training within a two-pass decoding process is pro14 posed. Additionally, we use neural network based features extracted by a hierar15 chical multi-layer-perceptron (MLP) network either in a hybrid MLP/HMM ap16 proach or to discriminatively retrain a Gaussian HMM system in a tandem approach. 17 The proposed framework and methods are evaluated for closed-vocabulary isolated 18 handwritten word recognition on the IfN/ENIT Arabic handwriting database, where 19 the word-error-rate is decreased by more than 50% relative compared to a ML 20 trained baseline system. Preliminary results for large-vocabulary Arabic machine 21 printed text recognition tasks are presented on a novel publicly available newspaper 22 database. 23 RWTH Aachen University Human Language Technology and Pattern Recognition Ahornstr 55, D-52056 Aachen, Germany Tel.: +49-241-80-21613 Fax: +49-241-80-22219 e-mail: @cs.rwth-aachen.de

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optical Character Recognition

Optical Character Recognition (OCR) is one of the challenging areas of pattern recognition. It gained popularity among the research community due to its vast application potentials. Extensive research has been done on OCR evidenced by a large number of research articles published in the literature during the last few decades. Most of the research works reported in this area are for Roman, Chine...

متن کامل

Kannada Character Recognition System A Review

Intensive research has been done on optical character recognition ocr and a large number of articles have been published on this topic during the last few decades. Many commercial OCR systems are now available in the market, but most of these systems work for Roman, Chinese, Japanese and Arabic characters. There are no sufficient number of works on Indian language character recognition especial...

متن کامل

Probabilistic sequence models for image sequence processing and recognition

This PhD thesis investigates the image sequence labeling problems optical character recognition (OCR), object tracking, and automatic sign language recognition (ASLR). To address these problems we investigate which concepts and ideas can be adopted from speech recognition to these problems. For each of these tasks we propose an approach that is centered around the approaches known from speech r...

متن کامل

An Arabic optical character recognition system using recognition-based segmentation

Optical character recognition (OCR) systems improve human}machine interaction and are widely used in many areas. The recognition of cursive scripts is a di$cult task as their segmentation su!ers from serious problems. This paper proposes an Arabic OCR system, which uses a recognition-based segmentation technique to overcome the classical segmentation problems. A newly developed Arabic word segm...

متن کامل

Lexicon Reduction for Urdu/Arabic Script Based Character Recognition: A Multilingual OCR

Arabic script character recognition is challenging task due to complexity of the script and huge number of ligatures. We present a method for the development of multilingual Arabic script OCR (Optical Character Recognition) and lexicon reduction for Arabic Script and its derivative languages. The objective of the proposed method is to overcome the large dataset Urdu and similar scripts by using...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011